64 research outputs found

    Online Domain Adaptation for Multi-Object Tracking

    Full text link
    Automatically detecting, labeling, and tracking objects in videos depends first and foremost on accurate category-level object detectors. These might, however, not always be available in practice, as acquiring high-quality large scale labeled training datasets is either too costly or impractical for all possible real-world application scenarios. A scalable solution consists in re-using object detectors pre-trained on generic datasets. This work is the first to investigate the problem of on-line domain adaptation of object detectors for causal multi-object tracking (MOT). We propose to alleviate the dataset bias by adapting detectors from category to instances, and back: (i) we jointly learn all target models by adapting them from the pre-trained one, and (ii) we also adapt the pre-trained model on-line. We introduce an on-line multi-task learning algorithm to efficiently share parameters and reduce drift, while gradually improving recall. Our approach is applicable to any linear object detector, and we evaluate both cheap "mini-Fisher Vectors" and expensive "off-the-shelf" ConvNet features. We quantitatively measure the benefit of our domain adaptation strategy on the KITTI tracking benchmark and on a new dataset (PASCAL-to-KITTI) we introduce to study the domain mismatch problem in MOT.Comment: To appear at BMVC 201

    Deep Fishing: Gradient Features from Deep Nets

    Full text link
    Convolutional Networks (ConvNets) have recently improved image recognition performance thanks to end-to-end learning of deep feed-forward models from raw pixels. Deep learning is a marked departure from the previous state of the art, the Fisher Vector (FV), which relied on gradient-based encoding of local hand-crafted features. In this paper, we discuss a novel connection between these two approaches. First, we show that one can derive gradient representations from ConvNets in a similar fashion to the FV. Second, we show that this gradient representation actually corresponds to a structured matrix that allows for efficient similarity computation. We experimentally study the benefits of transferring this representation over the outputs of ConvNet layers, and find consistent improvements on the Pascal VOC 2007 and 2012 datasets.Comment: To appear at BMVC 201

    Exploring the Limitations of Behavior Cloning for Autonomous Driving

    Get PDF
    Driving requires reacting to a wide variety of complex environment conditions and agent behaviors. Explicitly modeling each possible scenario is unrealistic. In contrast, imitation learning can, in theory, leverage data from large fleets of human-driven cars. Behavior cloning in particular has been successfully used to learn simple visuomotor policies end-to-end, but scaling to the full spectrum of driving behaviors remains an unsolved problem. In this paper, we propose a new benchmark to experimentally investigate the scalability and limitations of behavior cloning. We show that behavior cloning leads to state-of-the-art results, including in unseen environments, executing complex lateral and longitudinal maneuvers without these reactions being explicitly programmed. However, we confirm well-known limitations (due to dataset bias and overfitting), new generalization issues (due to dynamic objects and the lack of a causal model), and training instability requiring further research before behavior cloning can graduate to real-world driving. The code of the studied behavior cloning approaches can be found at https://github.com/felipecode/coiltraine

    Breaking the "Object" in Video Object Segmentation

    Full text link
    The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its capabilities by better modeling spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations

    Activity representation with motion hierarchies

    Get PDF
    International audienceComplex activities, e.g., pole vaulting, are composed of a variable number of sub-events connected by complex spatio-temporal relations, whereas simple actions can be represented as sequences of short temporal parts. In this paper, we learn hierarchical representations of activity videos in an unsupervised manner. These hierarchies of mid-level motion components are data-driven decompositions specific to each video. We introduce a spectral divisive clustering algorithm to efficiently extract a hierarchy over a large number of tracklets (i.e., local trajectories). We use this structure to represent a video as an unordered binary tree. We model this tree using nested histograms of local motion features. We provide an efficient positive definite kernel that computes the structural and visual similarity of two hierarchical decompositions by relying on models of their parent-child relations. We present experimental results on four recent challenging benchmarks: the High Five dataset [Patron-Perez et al, 2010], the Olympics Sports dataset [Niebles et al, 2010], the Hollywood 2 dataset [Marszalek et al, 2009], and the HMDB dataset [Kuehne et al, 2011]. We show that pervideo hierarchies provide additional information for activity recognition. Our approach improves over unstructured activity models, baselines using other motion decomposition algorithms, and the state of the art

    Modèles structurés pour la reconnaissance d'actions dans des vidéos réalistes

    Get PDF
    Cette thèse décrit de nouveaux modèles pour la reconnaissance de catégories d'actions comme "ouvrir une porte" ou "courir" dans des vidéos réalistes telles que les films. Nous nous intéressons tout particulièrement aux propriétés structurelles des actions : comment les décomposer, quelle en est la structure caractéristique et comment utiliser cette information afin de représenter le contenu d'une vidéo. La difficulté principale à laquelle nos modèles s'attellent réside dans la satisfaction simultanée de deux contraintes antagonistes. D'une part, nous devons précisément modéliser les aspects discriminants d'une action afin de pouvoir clairement identifier les différences entre catégories. D'autre part, nos représentations doivent être robustes en conditions réelles, c'est-à-dire dans des vidéos réalistes avec de nombreuses variations visuelles en termes d'acteurs, d'environnements et de points de vue. Dans cette optique, nous proposons donc trois modèles précis et robustes à la fois, qui capturent les relations entre parties d'actions ainsi que leur contenu. Notre approche se base sur des caractéristiques locales - notamment les points d'intérêts spatio-temporels et le flot optique - et a pour objectif d'organiser l'ensemble des descripteurs locaux décrivant une vidéo. Nous proposons aussi des noyaux permettant de comparer efficacement les représentations structurées que nous introduisons. Bien que nos modèles se basent tous sur les principes mentionnés ci-dessus, ils différent de par le type de problème traité et la structure sur laquelle ils reposent. Premièrement, nous proposons de modéliser une action par une séquence de parties temporelles atomiques correspondant à une décomposition sémantique. De plus, nous décrivons comment apprendre un modèle flexible de la structure temporelle dans le but de localiser des actions dans des vidéos de longue durée. Deuxièmement, nous étendons nos idées à l'estimation et à la représentation de la structure spatio-temporelle d'activités plus complexes. Nous décrivons un algorithme d'apprentissage non supervisé permettant de dégager automatiquement une décomposition hiérarchique du contenu dynamique d'une vidéo. Nous utilisons la structure arborescente qui en résulte pour modéliser une action de manière hiérarchique. Troisièmement, au lieu de comparer des modèles structurés, nous explorons une autre alternative : directement comparer des modèles de structure. Pour cela, nous représentons des actions de courte durée comme des séries temporelles en haute dimension et étudions comment la dynamique temporelle d'une action peut être utilisée pour améliorer les performances des modèles non structurés formant l'état de l'art en reconnaissance d'actions. Dans ce but, nous proposons un noyau calculant de manière efficace la similarité entre les dépendances temporelles respectives de deux actions. Nos trois approches et leurs assertions sont à chaque fois validées par des expériences poussées sur des bases de données publiques parmi les plus difficiles en reconnaissance d'actions. Nos résultats sont significativement meilleurs que ceux de l'état de l'art, illustrant ainsi à quel point la structure des actions est importante afin de bâtir des modèles précis et robustes pour la reconnaissance d'actions dans des vidéos réalistes.This dissertation introduces novel models to recognize broad action categories - like "opening a door" and "running" - in real-world video data such as movies and internet videos. In particular, we investigate how an action can be decomposed, what is its discriminative structure, and how to use this information to accurately represent video content. The main challenge we address lies in how to build models of actions that are simultaneously information-rich - in order to correctly differentiate between different action categories - and robust to the large variations in actors, actions, and videos present in real-world data. We design three robust models capturing both the content of and the relations between action parts. Our approach consists in structuring collections of robust local features - such as spatio-temporal interest points and short-term point trajectories. We also propose efficient kernels to compare our structured action representations. Even if they share the same principles, our methods differ in terms of the type of problem they address and the structure information they rely on. We, first, propose to model a simple action as a sequence of meaningful atomic temporal parts. We show how to learn a flexible model of the temporal structure and how to use it for the problem of action localization in long unsegmented videos. Extending our ideas to the spatio-temporal structure of more complex activities, we, then, describe a large-scale unsupervised learning algorithm used to hierarchically decompose the motion content of videos. We leverage the resulting tree-structured decompositions to build hierarchical action models and provide an action kernel between unordered binary trees of arbitrary sizes. Instead of structuring action models, we, finally, explore another route: directly comparing models of the structure. We view short-duration actions as high-dimensional time-series and investigate how an action's temporal dynamics can complement the state-of-the-art unstructured models for action classification. We propose an efficient kernel to compare the temporal dependencies between two actions and show that it provides useful complementary information to the traditional bag-of-features approach. In all three cases, we conducted thorough experiments on some of the most challenging benchmarks used by the action recognition community. We show that each of our methods significantly outperforms the related state of the art, thus highlighting the importance of structure information for accurate and robust action recognition in real-world videos.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF
    • …
    corecore